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Abstract 

We study the complexity of approximate representation and learning of submodular functions over 
the uniform distribution on the Boolean hypercube {0, 1}". Our main result is the following structural 
theorem: any submodular function is e-close in £2 to a real- valued decision tree (DT) of depth 0(l/e^). 
This immediately implies that any submodular function is e-close to a function of at most 2*^*^^/' ^ 
variables and has a spectral £1 norm of 2*^*^^/' \ It also implies the closest previous re sult that states tha t 
submodular functions can be approximated by polynomials of degree 0(l/e^) (Cheraghchi et al.ll2012l) . 
Our result is proved by constructing an approximation of a submodular function by a DT of rank 4/e^ 
and a proof that any rank-r DT can be e-approximated by a DT of depth §(?' + log(l/e)). 

We show that these structural results can be exploited to give an attribute-efficient PAC learning 
algorithm for submodular functions running in time 0{n'^) ■ l^^^l^ \ The best previous algorithm for 
the problem requires rP^^I'^ ) time and examples JCheraghchi et al.Ll2012l) but works also in the agnostic 



setting. In addition, we give improved learning algorithms for a number of related settings. 

We also prove that our PAC and agnostic learning algorithms are essentially optimal via two lower 
bounds: (1) an information-theoretic lower bound of 2^'^/'^ ^ on the complexity of learning monotone 
submodular functions in any reasonable model (including learning with value queries); (2) computational 
lower bound of vP-^^l^ ) based on a reduction to learning of sparse parities with noise, widely-believed 
to be intractable. These are the first lower bounds for learning of submodular functions over the uniform 
distribution. 



1 Introduction 



We study the problem of learning submodular functions and their (approximate) representation. Submod- 



ularit y, a discrete analog of convexity, has played an essential role in cor nbinatorial optimization (ILovasz , 



Oueyranne , 



1995UFleischer et al.L 120011) . rank function of m atroids (lEdmonds. .1970.. .Frank. .1997.) . set cov 



ering problems (Feigel Il998h . and plant location problems (IComuejols et aLl Il977h . Recently, interest in 



1983). It a ppears in many important s ettings including cuts in gr aphs ( Goemans and Williamsonl 1 1995 



submodular functions has been revived by new applicati ons in algorithmic game theory as well asjna 



chine learning. In rnachine learning, several applications (IGuestrin et al.l l2005l iKrause et al.L l2006l 12008 



Krause and Guestrinl 120111) have relied on the fact that the information provided by a collection of sensors 



*Work done while the author was at IBM Research - Almaden. 



is a submodular function. In algorithmic game theory, subn iodular functions have found appUcation as val- 
uation functions with the property of dim inishing ret urns (IB. Lehman n and NisanL I2006L IPobzinski et al. . 
2OO5L IVondr^ boosl IPapadimitriou et all lioOS. .Dughmi et all .201 1.) . 

Wide-spread applications of submodular functions have recently inspired the question of whether and 
how such functions can be learned from random examples (of an unknown submodular function). The 
question was first formally considered by iBalcan and Harveyl (|2012h who motivate it by learning of valu- 
ations functions. Previously, reconstruction of such functions up to some multiplicative factor from value 
queries (which allow the learner to ask for the value of the function at any point) was also considered by 
Goemans et al.l (120091) . These wor ks have lead to significant attention to seve r al variants of the problern 
of learning submodular functions (IGupta et all I2OIIL ICherashchi et all 1201 2L iBadanidiyuru et all 12012 , 
Balcan et all I2OI2I . iRaskhodnikova and Yaroslavtsevi . l2013h . We survey the prior work in more detail in 
Sections O and O 

In this work we consider the setting in which the learner gets random and uniform examples of an un- 
known submodular function / and its goal is to find a hypothesis function h which e-approximates / for a 
given e > 0. The main measures of the approximation error we use are the standard absolute error or £1- 
distance, which equals 'Ex^D[\f{x) — h{x)\] and ^2-distance w hich equals y/ 'Eix^plifix) — h(x))'^] (and 
upper-bounds the £1 norm). This is esse ntially the PAC mod el (IVahantlll984l) o f learning applied to real- 
valued functions (as done for example by iHaussleii (Il992h and lKeams et al.l(ll994l) ). It is also closely related 
to learning of probabilistic concepts (which are concepts expressing the probabil ity of the function being 1) 
in which the goal is to approxima te the unknown probabilist ic concept in £1 (Kea ms and S chapire. 1994). As 
follows from the previous work ([Balcan and Harveyl |2012|) . without assumptions on the distribution, learn- 
ing a submodular function to a constant ii error requires an exponential number of random examples. We 
therefore consider the problem with the distribution restricted to be unif orm, a setting widely-stud ied in the 
contex t of learning Boolean functions in the PAC model (e.g. lLinial et a l. (1993), O'Donnell and Servedio 
( 20071)). This special case is also the focus of several other recent works on learning submodular functions 
(JGupta et al.Ll201lLICheraghchi et al.ll20ll IRaskhodnikova and Yaroslavtsevi . 120131) 



1.1 Our Results 

We give three types of results on the problem of learning and approximating submodular function over the 
uniform distribution. First we show that submodular functions can be approximated by decision trees of 
low-rank. Then we show how such approximation can be exploited for learning. Finally, we show that our 
learning results are close to the best possible. 

Structural results: Our two key structural results can be summarized as follows. The first one shows that 
every submodular function can be approximated by a decision tree of low rank. The rank of a decisio n tree 
is a classic measure of complexity of decisions trees introduced by Ehrenfeucht and Hausslerl (Il989h . One 
way to define the rank of a decision tree T (denoted by rank(T)) is as the depth of the largest complete 
binary tree that can be embedded in T (see Section[2]for formal definitions). 

Theorem 1.1. Let f : {0, 1}" — t- [0, 1] be a submodular function and e > 0. There exists a real-valued 
binary decision tree T of rank at most 4/e^ that approximates f within £2-error e. 



This result is based on a decomposition technique of ' Gupta et all (l201lb that shows that a submodular 
function / can be decomposed into disjoint regions where / is also a-Lipschitz (for some a > 0). We prove 
that this decomposition can be computed by a binary decision tree of rank 2/a. Our second result is that 
over the uniform distribution a decision tree of rank r can be e-approximated by a decision tree of depth 

0(r + log(l/e)). 



Theorem 1.2. Let T be a binary decision tree of rank r. Then for any integer d > 0, T truncated at depth 
d = |(r + log(l/e)) gives a decision tree T<d such that, Pri^[T(a;) / T<d{x)] < e. 

It is well-known (e.g. (IKushilevitz and Mansoun . 1 19930 ). that a decision tree of size s (i.e. with s leaves) 



is e-close to the same decision tree pruned at depth log(s/e). It is also well-known that for any decision 
tree of size s has rank of at most log s. Therefore Theorem II. 2l (strictly) generalizes the size-based pruning. 
Another implication of this result is that several known algorithms for learning polynomial-size DTs over 
the uniform distribution (e.g. (IKushilevitz and Mansourl 1 19931 iGopalan et al.L 120081) ) can be easily shown 
to also learn DTs of logarithmic rank (which might have superpolynomial size). 

Combining Theorems II. ll andl l. 21 we obtain that submodular functions can be approximated by shallow 
decision trees and consequently as functions depending on at most 2^°'^^'^^''^'^ variables. 

Corollary 1.3. Let f : {0, 1}" — t- [0, 1] be a submodular function and e > 0. There exists a binary decision 
tree T of depth d = 0(l/e^) with constants in the leaves such that HT — /||2 < e. In particular, T depends 
on at most 2 '^' '^ •* variables. 

We remark that it is well-known that a DT of depth d can be written as a polynomial of degre e d. This 



gives a simple combinatorial proof of the low-degree approximation of (|Cheraghchi et al.L 120121) which is 
based on an analysis of the noise stability of submodular functions. In addition, in our case the polyno- 
mial depends only on 2^ ^^''^ ' variables, which is not true for the approximating polynomial constructed in 



(iCheraghchi et al.Ll2012D . 

Algorithmic applications: We show that these structural results can be used to obtain a number of new 
learning algorithms for submodular functions. One of the key issues in applying our approximation by a 
function of few variables is detecting the 2*^*-^/'^ > variables that would suffice for approximating a sub- 
modular function given random examples alone. While for general functions this probably would not be 
an efficiently solvable problem, we show that a combination of (1) approximation of submodular functions 
by low-degree polynomials of low spectral (Fourier) ii norm (implied by the DT approximation) and (2) 
the discrete concavity of submodular functions allow finding the necessary variables by looking at Fourier 
coefficients of degree at most 2. 

Lemma 1.4. There exists an algorithm that given uniform random examples of values of a submodular 
function f : {0, 1}" — )• [0, 1], finds a set of 2^^^''^ ' variables J such that there is a function fj depending 
only on the variables in J and satisfying \\f — fj\\2 < £• The algorithm runs in time r? log(n) • 2 '•^''^ •* 
and uses log(n) • 2^^^''^ ' random examples. 

Combining this lemma with Corollary 11.31 and using standard Fourier-based learning techniques, we 
obtain the following learning result in the PAC model. 

Theorem 1.5. There is an algorithm that given uniform random examples of any submodular function f : 

{0, 1}" — )• [0, 1], outputs a function h, such that \\f — h\\2 < e. The algorithm runs in time 0{n?') ■ 2^^^''^ ' 
and uses 2'^'^' "^ ' log n examples. 

In the language of approximation algorithms, we give the first efficient polynomial-time approximation 
scheme (EFTAS) algorithms for the problem. We note that the be st previously known alg orithm for learning 



of submodular functions within ^i-error e runs in time rp^'^l'' > ((Cheraghchi et al.Ll2012h . in other words is 
a PTAS (this algorithm works also in the agnostic setting). 

We also give a faster algorithm for agnostic learning of submodular functions, provided that we have 
access to value queries (returning fix) for a given point x G {0, 1}"). 



Theorem 1.6. Let Cg denote the class of all submodular functions from {0, 1}" to [0, 1]. There is an agnostic 
learning algorithm that given access to value queries for a function f : {0, 1}" — t- [0, 1], outputs a function 
h such that \\f — h\\2 < A + e, where A = uiiUgi^CsiWf ~ ^Ib}- The algorithm runs in time poly(n, 2^''' ) 
and uses poly(log n, 2^/'^ ) value queries. 



This algorithm is based on an attribute-efficient v ersion of the Kus hilevitz-Mansour algorithm r Kushilevitz and Mansour 



1993i) for finding significant Fourier coefficients by Feldmanl (120070 . We also show a different algorithm 
with the same agnostic guarantee but relative to the ^i-distance (and hence incomparable). In this case the 
algorithm is based on an att r ibute-efficient agnostic learning of decision trees which results from agnostic 
boosting (Kalai and K anadd . 1 2009 . iFeldmanl . 12010 ) applied to the attribute-efficient algorithm for learning 
parities (F eldman..2007h . 

Finally, we dis cuss the special case of submodul a r func tion with a discrete range {0,1, ... ,k} studied 
in a recent work of iRaskhodnikova and YaroslavtsevI (120 13h . We show that an adaptation of our techniques 
implies that such submodular functions can be exactly represented by rank-2A; decision trees. This directly 
leads to new structural results and faster learning algorithms in this setting. A more detailed discussion 
appears in Section IbI 

Lower bounds: We prove that an exponential dependence on e is necessary for learning of submodular func- 
tions (even monotone ones), in other words, there exists no fully polynomial-time approximation scheme 
(FPTAS) for the problem. 

Theorem 1.7. PAC-leaming monotone submodular functions with range [0,1] within ii-error of e > 
requires 2^^ ' value queries to f. 

Our proof shows that any function g of t variables can be embedded into a submodular function fg over 
2t variables in a way that any approximation of fg to accuracy 9{t~^''^) would yield a 1/4 approximation of 
g. The latter is well known to require i7(2*) random examples (or even value queries). This result implies 
optimality (up to the constant in the power of e) of our PAC learning algorithms for submodular functions. 

Further, we prove that agnostic learning of monotone submodular functions is computationally hard via 
a reduction from learning sparse parities with noise. 

Theorem 1.8. Agnostic learning of monotone submodular functions with range [0,1] within ii-error of 

e > in time T(n, 1/e) would imply learning of parities of size e~^'^ with noise of rate 7] in time 
1 



poly(n. 



6(l-2r,) 



) + 2T{n, 



e(l-2r?) 



) for some fixed constant c. 



Learning of sparse parities with noise is a well-studied open problem in learning theory closely related to 
problems in coding theory and cryptogra phy. It is known to be at least as hard as learning of DNF expression 
and juntas over the uniform distribution (Feldman et al.ll2009l) . The trivial algorithm for learning parities on 
k variables from random examples corrupted by random noise of rate rj takes time n^ ■ poly( j-^)- The only 
known improvement to this is an elegant algorithm of lValiantI (120121) which runs in time n^-^'' ■ poly( _^_^^ ). 

These results suggest that agnostic learning of monotone submodular functions in time n°^'^ ) would 
require a breakthrough in our understanding of these long-standing open problems. In particular, a running 
time such as 2P°^y(^/'^)poly(n), which we achieve in the PAC model, cannot be achieved for agnostic learning 



of sub modular functions. In other words, we show that the agnostic learning algorithm of ICheraghchi et al. 
(120 12h is likely close to optimal. We note that this lower bound does not hold for boolean submodular 



functions. Monotone boolean submodular functions are disjunctions and hence are agnostically leamable in 
fiO{iog{i/e)) tjjj^e Pqj- further details on lower bounds we refer the reader to Section[6] 



1.2 Related Work 



Below we briefly mention some of the other related work. We direct the reader to (JBalcan and HarveyLl2012h 
for a detailed survey. Balcan and Harvey study learning of submodular functions without assumptions on the 
distribution and also require that the algorithm output a value which is within a multiplicative approximation 
factor of the true value with probability > 1 — e (the model is referred to as PMAC lear ning). This is a 



very demanding setting and indeed one of the main results in (|Balcan and Harveyl. l2012h is a factor- 



n 



inapproximability bound for submodular functions. This notion of approximation is also considered in 
subsequent works (|Badariidiyuru et all I2012L iBalcan et al.l |2012J) where upper and lower approximation 
bounds are given for othe r related classes of functions such as XOS and subadditive. The lower bound of 



Balcan and Harveyl (|2012l ) also implies hardness of learning of submodular function with ii (or £2) error: 
it is impossible to learn a submodular function / : {0, 1}" — )• [0, 1] in poly(n) time within any nontrivial 
ii error over general distributions. We emphasize that these strong lower bounds rely on a very specific 
distribution concentrated on a sparse set of points, and show that this setting is very different from the 
setting of uniform/product distributions which is the focus of this paper. 

For product distributions, Balcan and Harvey show that 1-Lipschitz submodular functions of minimum 
nonzero value at least 1 have concentration properties implying a PMAC algorithm pr oviding an Oflog ^)- 
factor approximation except for an e-fraction of points, using 0{-nlogn) samples (IBalcan and Harveyl. 
I2OI2). In our setting, we have no assumption on the minimum nonzero value, and we are interested in the 
additive ^1 -error rather than multiplicative approximation. 



Gupta et all (I2OIII) show that submodular functions can be e-approximated by a collection of n'^^^/'^ ) 



e^-Lipschitz submodular functions. Each e^-Lipschitz submodular function can be e-approximated by a 
constant. This leads to a learning algorithm running in time n^^^''^ >, which however requires value oracle 
access to the target function, in order to build the collection. Their decomposition is also the basis of our 
approach. We remark that our algorithm ca n be directly trans lated into a faster algorithm for the private 
data release which motivated the problem in (JGupta et al.l 12011 ). However, for one of their main examples 
which is privately releasing disjunctions one does not need the full generality o f submodular functions 
Coverage functions suffice and for those even faster algorithms are now known (|Cheraghchi et al.L 12012 , 
Feldman and Kotharil . [2013l) . 



In a concurrent work, Feldman and Kotharil (|2013|) consider learning of coverage functions. Coverage 
functions are a simple subclass of submodular functions which can be characterized as non-negative combi- 
nations of monotone disjunctions. They show that over the uniform distribution any coverage function can 
be approximated by a polynomial of degree log(l/e) over 0(l/e^) variables and also prove that coverage 
functions can be PAC learned in fully-polynomial time (that is, with polynomial dependence on both n and 
1/e). Note that our lower bounds rule out the possibility of such algorithms for all submodular functions. 
Their techniques are different from ours (aside from applications of standard Fourier representation-based 
algorithms). 



2 Preliminaries 



We work with Boolean functions on {0, 1}". Let U denote the uniform distribution over {0, 1}"^. 

Submodularity A set function / : 2^ ^ M is submodular if f{A U B) + f{A D B) < f{A) + f{B) for 
all A,BCN. In this paper, we work with an equivalent description of set functions as functions on the 
hypercube {0, 1}". 

For X G {0, 1}", b € {0, 1} and i £ n, let Xi^h denote the vector in {0, 1}" that equals x with i-th 



coordinate set to b. For a function / : {0, 1}" — )• M and index i G [n] we define dif{x) = /(xj^i) — 
f{xi^o). A function / : {0, 1}" — )• M is submodular iff dtf is a non-increasing function for each i G [n], 
or equivalently, for all i / j, dijf{x) = di{djf{x)) < 0. A function / : {0, 1}" — )■ R is a-Lipschitz if 

dif{x) G [-a, a] for alH G [n],x £ {0,1}'^. 

Absolute error vs. Error relative to norm: In our results, we typically assume that the values of f{x) are 
in a bounded interval [0,1], and our goal is to learn / with an a dditive error of e. Some p rior work considered 



an error relative to the norm of /, for example at most e||/||i (|Cheraghchi et al.Ll2012() . In fact, it is known 
that for nonnegative submodular functions, ||/||i = E[/] > |||/||oo and hence this does not make much 
difference. If we scale /(x) by l/(4||/||i), we obtain a function with values in [0, 1]. Learning this function 
within an additive error of e is equivalent to learning the original function within an error of 4e||/||i. 

Decision Trees: We use xi,X2, . . . ,x„ to refer to n functions on {0, 1}" such that Xj(x) = Xi. Let 
X = {xi,X2, . . . ,x„}. We represent real-valued functions over {0, 1}" using binary decision trees in 
which each leaf can itself be any real-valued function. Specifically, a function is represented as binary tree 
T in which each internal node labeled by some variable x G X and each leaf £ labeled by some real- valued 
function fi over variables not restricted on the path to the leaf. We refer to a decision tree in which each leaf 
is labeled by a function from some set of functions T as J^- valued. If T contains only constants from the 
domain of the function then we obtain the usual decision trees. 

For a decision tree T with variable x,. G X at the root we denote by Tq (Ti) the left subtree of T (the 
right subtree, respectively). The value of the tree on a point x is computed in the standard way: if the tree 
is a leaf £ then T{x) = fe{xx[v])' where X[v] is the set of indices of variables which are not restricted on 
the path to £ and xx[v] is the substring of x containing all the coordinates in X[v]. If T is not a leaf then 
T(x) = T^^i^\{x) where x^ is the variable at the root of T. 



The rank of a decision tree T is defined as follows (|Ehrenfeucht and Hausslen.ll989() . If T is a leaf, then 
rank(T) = 0. Otherwise: 



rank(T) 



max{rank(To), rank(Ti)} if rank(To) / rank(ri); 
rank (To) + 1, otherwise. 



The depth of a node u in a tree T is the length of the path the root of T to v. The depth of a tree is the 
depth of its deepest leaf. For any node v £ T we denote by T[v] the sub-tree rooted at that node. We also 
use T to refer to the function computed by T. 

Fourier Analysis on the Boolean Cube We define the notions of inner product and norms, which we 
consider with respect to U. For two functions /, g : {0, 1}" — ;■ M, the inner product of / and g is defined 

as {f,g) = B^r^ulfix) ■ g{x)]. The ii and £2 norms of / are defined by ||/||i = E^r^uilfix)]] and 
II/II2 = (E,^z,[/(x)2])V2 respectively. 

For S C [n], the parity function xs '■ {0,1}" — )• {—1,1} is defined by xsi^) = {—l)'^^es^\ The 
parities form an orthonormal basis for functions on {0, 1}" under the inner product product with respect to 
the uniform distribution. Thus, every function / : {0, 1}" — )• M can be written as a real linear combination 
of parities. The coefficients of the linear combination are referred to as Fourier coefficients of /. For 
/ : {0, 1}" — )• M and 5 C [n], the Fourier coefficient f{S) is given by f{S) = {f,Xs)- For any Fourier 
coefficient f{S), \S\ is called the degree of the coefficient. 

The Fourier expansion of / is given by f{x) = Ylscln] f{^)xs{x)- The degree of highest degree non- 
zero Fourier coefficient of / is referred to as the Fourier degree of /. Note that Fourier degree of / is exactly 
the polynomial degree of / when viewed over {—1, 1}" instead of {0, 1}" and therefore it is also equal to 
the polynomial degree of / over {0, 1}". Let / : {0, 1}" — )• M and / : 2["1 — )• R be its Fourier Transform. 



The spectral li norm of / is defined as 



E 1/(^)1- 



5C[n] 

The Fourier transform of partial derivatives satisfies: dif{x) = 2 ^^^j f {S)xs\{i}{x) , and dijf{x) = 

'iY.S3i,jf(s)xs\{i,j}{x)- 

Learning Models Our learning algorithms are in one of two standard models of learning. The first one 
assumes that the learner has access to random examples of an unknown function from a know n set of 
funct i ons. This model is a generalization of Valiant's PAC learning model to real- valued functions dValiant , 
19841 lHausslelll992h . 



Definition 2.1 (PAC £i-learning). Let T be a class of real-valued functions on {0, 1}" and let D be a 
distribution on {0, 1}". An algorithm A PAC learns J- on D, if for every e > and any target function 
f ^ T, given access to random independent samples from T> labeled by f, with probability at least |, A 
returns a hypothesis h such that 'Exr^T>[\f{x) — h{x)\] < e. ^ is said to be proper if /i G J^. 

While in general Valiant's model does not make assumptions on the distribution V, here we only consider 
the distribution-specific version of the model in which the distribution is fixed and is uniform over {0, 1}". 
The error parameter e in the Boolean case measures probability of misclassification. Agnostic learning 
generalizes the definition of PAC learning to scenarios where one cannot assume that the input labels are 
consistent with a function from a given class (IHausslen. Il992l iKeams et al.l 119941) (for example as a result 
of noise in the labels). 

Definition 2.2 (Agnostic ^i-leaming). Let F be a class of real-valued functions from {0, 1}" to [0, 1] and 
let D be any fixed distribution on {0, 1}". For any function f, let opt{f, T) be defined as: 

opt{f,F)=ini^x^vMx)-f{x)\]. 

An algorithm A is said to agnostically learn T onT) if for every e > and any function / : {0, 1}" — )■ [0, 1], 
given access to random independent examples of f drawn from V, with probability at least g, A outputs a 
hypothesis h such that 

^xM\h{x)-f{x)W<opt{f,F) + e. 

The (.2 versions of these models are defined analogously. 



3 Approximation of Submodular Functions by Low-Rank Decision Trees 

We now prove that any bounded submodular function can be represented as a low-rank decision tree 
w ith g-Lipsch i tz sub modular functions in the leaves. Our construction follows closely the construction 
of iGupta et al.l (120111) . They show that for every submodular / there exists a decomposition of {0, 1}" into 
rpy^l^i disjoint regions restricted to each of which / is a-Lipsc hitz subrnodular . In essence, we give a 
binary decision tree representation of the decomposition from dGupta et all l201lh and then prove that the 
decision tree has rank 0{l/a). 

Theorem 3.1. Let f : {0, 1}" — >• [0, 1] be a submodular function and a > 0. Let Ta denote the set of 
all a-Lipschitz submodular functions with range [0, 1] over at most n Boolean variables. Then f can be 
computed by an Ta-valued binary decision tree T of rank r < 2/a. 



We first prove the claim tliat decomposes a submodular function / into regions where / where discrete 
derivatives of / are upper-bounded by a everywhere: we call this property a-monotone decreasing. 

Definition 3.2. For a G M, / is a-monotone decreasing if for all i G [n] and x G {0, 1}", dif{x) < a. 

We remark that a-Lipschitzness is equivalent to discrete derivatives being in the range [—a, a], i.e. / as 
well as — / being a-monotone decreasing. 

Lemma 3.3. For a > let f : {0, 1}" — ?■ [0, 1] be a submodular function. Let Ai^ denote the set of all 
a-monotone decreasing submodular functions with range [0, 1] over at most n Boolean variables, f can be 
computed by a Ma-valued binary decision tree T of rank r < 1/a. 

Proof. The tree T is constructed recursively as follows: if n = then the function is a constant which can be 
computed by a single leaf. If / is a-monotone decreasing then T is equal to the leaf computing /. Otherwise, 
if / is not a-monotone decreasing then there exists i G [n] and z G {0, 1}" such that dif{z) > a. In fact, 
submodularity of / implies that dif is monotone decreasing and, in particular, 9^/(0) > dif{z) > a. We 
label the root with Xj and build the trees Tq and Ti for / restricted to points x such that Xj = and Xj = 1, 
respectively (viewed as a function over {0, 1}"^^). Note that both restrictions preserve submodularity and 
a-monotonicity of /. 

By definition, this binary tree computes f{x) and its leaves are a-monotone decreasing submodular 
functions. It remains to compute the rank of T. For any node ti G T, we let X[v] C [n[be the set of indices 
of variables that are not set on the path to v, let X[t;] = [n] \X[v] and let y[v] G {0, Ij^'^^H denote the values 
of the variables that were set. Let {0, Ij^'^H be the subcube of points in {0, 1}" that reach v, namely points 
X such that xx[v] = y[v]- Let f[v]{x) = T[v]{x) be the restriction of / to the subcube. Note that the vector 
of all O's, in the {0, Ij^'^H subcube corresponds to the point which equals y[v] on coordinates in X[v] and 
on all other coordinates. We refer to this point as x[v]. 

Let M = raaxxifi^)}- We prove by induction on the depth of T[v] that for any node v £ T, 

a 
This is obviously true if w is a leaf. Now, let v be an internal node v with label Xj. Let vq and vi denote the 
roots of T[v]o and T[w]i, respectively. For vq, x[vo] = x[v] and therefore f[v]{0) = f[vQ]{0). By inductive 
hypothesis, this implies that 

vm ,, ^ M - f[vom M-/M(0) 

rank[r[woJJ < = . (2) 

a a 

We know that dif[v]{0) > a. By definition, dif[v]{0) = f[v]{di^i) - f[v]{0). At the same time, 

/M(0,^i) = /(xH.^i) = f{x[vi]) = f[vi]{0). Therefore, f[vi]{0) > f[v]{0) + a. By the inductive 

hypothesis, this implies that 

,anMT[„,ll < E^llhm < M-m(0)-a ^ M-/M(0) _^ ,3, 

a a a 

Combining equations Q and ^ and using the definition of the rank we obtain that equation ([T]) holds for v. 
The claim now follows since / has range [0, 1] and thus M < 1 and /(O) > 0. D 



We note that for monotone functions Lemma 13.31 implies Th eorem 13. 1 1 since di screte derivatives of a 



monotone function are non-negative. As in the construction in (iGupta et all 120111) . the extension to the 
non-monotone case is based on observing that for any submodular function /, the function f{x) = f{-'x) 
is also submodular, where -ix is obtained from x by flipping every bit. 



Proof of Theorem 1X71 We first apply Lemma 1331 to obtain an TWa-valued decision tree T' for / of rank 

< I /a. Now let i. be any leaf of T' and let f[l] denote / restricted to I. As before, let X[l] C [n] be the 
set of indices of variables that are not restricted on the path to i. and let {0, Ij^I^l be the subcube of points 
in {0, 1}" that reach I. We now use Lemma [331 to obtain an TWa-valued decision tree T^ for f[l] of rank 

< \/a. We denote by -iT^ the tree computing the function T^(-iz). It is obtained from T^ by swapping the 
subtrees of each node and replacing each function g{z) in a leaf with g{-^z). We replace each leaf I of T' 
by -iT^ and let T be the resulting tree. To prove the theorem we establish the following properties of T. 

1. Correctness: we claim that T{x) computes /(x). To see this note that for each leaf i of T', -^T^{z) 
computes Tt{-^z) = Wlh^) = f[l]{z). Hence T{x) = T'{x) = f{x). 

2. a-Lipschitzness of leaves: by our assumption, f[£] is an a-monotone decreasing function over {0, 1}^M 
and therefore dif[£]{z) > -a for all i € X[£] and z e {0, 1}^^. This means that for all i G X[£] 
andzG{0,l}^M, 

diWKz) = -dimh^) < «• (4) 

Further, let k be a leaf of Ti computing a function /[^][k]. By Lemma [331 /[^][k] is a-monotone 
decreasing. Together with equation [4] this implies that /[^][k] is a-Lipschitz. In -iT^, /[^][k](2;) is 
replaced by /[£][k](-iz). This operation preserves a-Lipschitzness and therefore all leaves of T are 
a-Lipschitz functions. 

3. Submodularity of the leaf functions: for each leaf £, f[£] is submodular simply because it is a restric- 
tion of / to a subcube. 

4. Rank: by Lemma[331 rank(r') < 2/a and for every leaf £ of T', rank(^r£) = rank(r£) < 1/a. 
As can be easily seen from the definition of rank, replacing each leaf of T' by a tree of rank at most 
1/a can increase the rank of the resulting tree by at most 1/a. Hence the rank of T is at most 2/a. 

D 

3.1 Approximation of Leaves 

An important property of the decision tree representation is that it decomposes a function into disjoint re- 
gions. This implies that approximating the function over the whole domain can be reduced to approximating 
the function over individual regions with the same error parameter. Then, as in dOupta et al.L [201 ih . we can 
use concentration properties of a-Lipschitz submodular functions on the uniform distribution U over {0, 1}" 
to approximate each a-Lipschitz submodular functions by a constant. 

Formally we state the following lemma which allows the use of any loss function L. 

Lemma 3.4. For a set of functions T, let T be an T-valued binary decision tree, D be any distribution 
over {0, 1}" and L : M x R — )• M Z^e any real-valued (loss) function. For each leaf £ G T, let D[£] be the 
distribution over {0, Ij^^'^J that equals D conditioned on x reaching £; let gp be a function that satisfies 

^,^Dm[L{T[£]{z),g,{z))]<e. 

Let T' be the tree obtained from T by replacing each function in a leaf £ with the corresponding gi. Then 

B,^D[L{T{x),r{x))]<e. 



Proof. For a leaf £ G T, let y{l] G {0, 1}-^^^^ denote the values of the variables that were set on the path to 
I. Note that the subcube {0, 1}"^M corresponds to the points x G {0, 1}" such that xx[i] = y[P\- 



E,^z?[L(T(x),T'(x))] = Y,^--D [L{T{x),T'{x)) \ x^n] = y[i]] ■ Pr [xj,[e] = y[i]] 

ieT 

= ^E,^B[,] [L{T[i]{z),ge{z))] • Pr [xx[e] = vi^]] 



Xr^D 

ieT 



^E'-.FJni^m = yi^]] 



eeT 



D 



It is known that 1-Lipschitz st ibmodular functions sat i sfy strong concentration properties over th e uni- 
form distribution U over {0, 1}"^ (|Boucheron et all bOOOL IVondrakL bOloL iBalcan and HarveyL llOlJ), with 



standard deviation 0(yE[7]) and exponentially decaying tails. For our purposes we do not need the expo- 
nential tail bounds and instead we state the following simple bound on variance. 

Lemma 3.5. For any a-Lipschitz submodular function f : {0, 1}" — )■ R+, 

\aru[f]<2a-Bu[f]. 

Proof. By the Efron-Stein inequality (see (IBoucheron et al.lbOOOh ). 



2 ^ — ' 2 ie[n] ^ — ' 2 ^ — ■' 

ie[nl i£\n] iefnl 



We can now use the fact that non-negative submodular functions are 2-self -bounding (| VondrakL llO 1 Oh . and 

hence E.eME"[l'5^/l] =2Ex~wEi:/(.ee.)</w(/(^)-/(^®e,))] <4Ei,[/]. D 

We can now finish the proof of Theorem 11.11 

Proof of Theorem [7771 Let T' be the J^^-valued decision tree for / given by Theorem l3.1l with a = e^/2. For 
every leaf i we replace the function T' [£] at that leaf by the constant Ey [T' [i]] (here the uniform distribution 
is over {0, l}"'*^!^]) and let T be the resulting tree. 

Cor. 13.51 implies that for any e^/2-Lipschitz submodular function g : {0,1}"* — )• [0,1], Var^[(7] = 
Eu[{g - ^uig])^] < '^t'^uM < e^- For every leaf £ G T', T'[£] is eV2-Lipschitz and hence, 



2 



Eu[{T'[i]{z) - T[i]{z)f] = Bu[{T'[i]{z) - Bu[T'[i]]f] < e' . 
By Lemma[Ml(with L{a, b) = {a - bf), we obtain that Ew[(T(x) - /(x))^] < e^. D 

4 Approximation of Low-Rank Decision Trees by Shallow Decision Trees 

We show that over any constant-bounded product distribution D, a decision tree of rank r can be e-approximated 
by a decision tree of depth 0{r + log(l/e)). The approximating decision tree is simply the original tree 
pruned at depth d = 0{r + log(l/e)). 

For a vector fi G [0, 1]" we denote by D^ the product distribution over {0, 1}", such that Pr/)^, [xj = 
1] = fii. For a G [0, 1/2] a product distribution D^ is a-bounded if ^ G [a, 1 — a]". For a decision tree T 
and integer d > we denote by T-'^ a decision tree in which all internal nodes at depth d are replaced by a 
leaf computing constant 0. 
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Theorem 4.1. (Theorem \1.2\ restated) For a set of functions T let T he a T -valued decision tree of rank r, 
and let D^ he an a-hounded product distrihution for some a G (0, 1/2]. Then for any integer d >0, 






Pr[r^'^(x) / T{x)] < T'^ • 1 - - 



In particular, for d = [(r + log(l/e))/log(2/(2 - a))J we get thatVi:D^,[T^'^{x) i-T{x)] < e. 

Proof. Our proof is by induction on the pruning depth d. If T is a leaf, the statement trivial since T-'^{x) = 
T{x) for any d > 0. For d = and r > 1, 2^'^^ "(!"§) ^1- We now assume that the claim is true for 
all pruning depths 0, . . . , d — 1. 

At least one of the subtrees Tq and Ti has rank r — 1. Assume, without loss of generality that this is Tq. 
Let Xj be the label of the root node of T. 

Pr[r^^(x) / T{x)] = (1 - ^0 Ft[T,^''\x) / To(x)] + /., • Pr[rP"i(x) / Ti(x)] . 

^/^f ^/^ J-^fi 

By our inductive hypothesis, 

Fv[T,^'~\x) / ro(x)] < 2'-^ • (l - I' "'' 
and 



Pr[To^'^-^(x)/To(x)]<2'^-^.^l ^ 



d-l 



Combining these we get that 



Pr[r^'^(x) / r(x)] < (1 - ^,,)r'' • (i - 1)'"' + /x, • 2^-^ • (1 - 1' '"' 






a 



D 



For the uniform distribution we get error of at most e for d = (r + log(l/e))/log(4/3) < |(r + 
log(l/e)). 

An immediate corollary of Theorems 14. 1 l and l 1 . 1 l is that every submodular function can be e-approximated 
ove r the uniform distribution by a bi narv decision tree of depth 0(l/e^) (Corollary [O). 



Kushilevitz and Mansoun (119931 ) showed that the spectral £i norm of a decision tree of size s is at most 



s. Therefore we can immediately conclude that: 

Corollary 4.2. Let f : {0, 1}" — )• [0, 1] he a suhmodular function and e > 0. There exists a function 

p : {0, 1}" -^ [0, 1] such that \\p - /Ha < e and \\p\\i = 2<^(i/^'). 

5 Applications 

In this section, we give several applications of our structural results to the problem of learning submodular 
functions. 
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5.1 PAC Learning 

In this section we present our results on learning in the PAC model. We first show how to find 2'^^^' "^ ' vari- 
ables that suffice for approximating any submodular function using random examples alone. Using a fairly 
standard argument we first show that for any function / that is close to a function of low polynomial degree 
and low spectral £i norm (which is satisfied by submodular functions) variables sufficient for approximating 
/ can be found by looking at significant Fourier coefficients of / (the proof is in App. ??) 

Lemma 5.1. Let f : {0, 1}" — t- [0, 1] be any function such that there exists a function p of Fourier degree d 
and spectral ii norm \\p\\i = L for which ||/ — p||2 < £■ Define 

J = {i\ 3S-ieS,\S\ <dand\f{S)\ >e^/L}. 

Then \J\ < d • L^/e^ and there exists a function p' of Fourier degree d over variables in J such that 

\\f-ph<2e. 

Proof. Let 

5 = {5 I |5| < d and \f{S)\ > e^/L}. 

By Parseval's identity, there are at most L^/e^ sets in S. Clearly, J is the union of all the sets in S. Therefore, 
the bound on the size of J follows immediately from the fact that each set S* G 5 has size at most d. 

Let p' be the projection of p to the subspace of {xs '■ S G S}, that is p' = J2s€sP('^)xs- Now using 
Parseval's identity we get that 

11/ -Pill = E(/(^)-^(^))'- 

SC[n] 

Now we observe that for any S, \f{S) — p{S)\ < \f{S) — p'{S)\ can happen only when S ^ S in which 
csLsep'{S) = and |/(5)| < e^/L. 

\piS)\ < 2|/(S')|; hence only when \p{S)\ < 2e^/L. In this case, 

ifiS)-p'iS)f - ifiS) -piS)f = 2fiS)piS) - ipiS)f < 2fiS)piS) < 2\piS)\ ■ e^/L . 

Therefore, 

11/ - P'WI - 11/ - Plli = E(/(^) - ^'(^))' - (/(^) - ^(^))' ^ X ^ '^^^)' <^-\\p\\i= 2e\ 

s s 

This implies that 1 1 / - p' 1 1 ^ < Se^ . D 

The second and crucial observation that we make is a connection between Fourier coefficient of {i,j} 
of a submodular function and sum of squares of all Fourier coefficients that contain {i, j}. 

Lemma 5.2. Let f : {0, 1}" — )■ [0, 1] be a submodular function and i,j £ [n], i ^ j. 

l/({^,i})l>^E(/(^))'- 
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Proof. 



I/({^,J})I 



= ('^) \W'u[^^^,f]\ 



-.(b) 



:^umd,f\] >W -Bu \id,d,f) 






Here, (a) follows from the basic properties of the Fourier specti-um of partial derivatives (see Sec. ID; (6) is 
implied by second partial derivatives of a submodular function being always non-positive; and (c) follows 
from \didjf\ having range [0, 2] whenever / has range [0, 1]. D 

We can now easily complete the proof of Lemma [L4l 

Proof of Lemma \L4\ The proof relies on two simple observations. The first one is that Lemma |5?T] implies 
that the set of indices I-y = {i \ 3S 3 i, \f{S)\ > 7} satisfies the conditions of Lemma [L4l for some 

Now if i G /^ then either |/({i})| > 7 or, exists j / i, such that for some S' B i,j, \f{S')\ > 7. In the 
latter case X^^gj (/(S*))^ > 7^. By Lemma [5^ we can conclude that then \f{{i,j})\ > 27^. 

This suggests the following simple algorithm for finding J. Estimate degree 1 and 2 Fourier coefficients 
of / to accuracy 7^/2 with confidence at least 5/6 using random examples (note that 7 < 1/2 and hence 
degree- 1 coefficients are estimated with accuracy at least 7/4. Let f{S) for S C [n] of size 1 or 2 denote 
the obtained estimates. We define 



J={^\3je[n],\f{{i,J})\>3Jy2] 



If the estimates are correct, then clearly, I^ C J. At the same time, J contains inly indices which belong to a 



Fourier coefficient of magnitude at least 7 and degree at most 2. By Parseval's identity, \J\ < 2II/II2/7 = 
20(iA2)_ 

Finally, to bound the running time we observe that, by Chemoff bounds, 0(log(n)/7^) = log(n) • 
2<^(i/'^ ) random examples are sufficient to obtain the desired estimates with confidence of 5/6. The estima- 
tion of the coefficients can be done in n^ log(n) • 2^^^^^ ) time. D 

Now given a set J t hat was output by the algorithm in Lemma [L4] one can simply run the standard 
low-degree algorithm of iLinial et al.l (119931) over variables with indices in J to find a linear combination 



of parities of degree 0(l/e^), e-close to /. Note that we need to find coefficients of at most | Jp'-^/'^ > < 



min 



|20(iA 



) r,O(lA0 



} parities. This immediately implies Theorem 1 1.5 1 



5.2 Agnostic learning with value queries 

Our next application is agnostic learning of submodular functions over the uniform distribution with value 
queries. We give two versions of the agnostic learning algorithm one based on £1 and the other based on £2 
error. We note that, unlike in the PAC setting where small £2 error also implied small £1 error, these two 
versions are incomparable and are also based on different algorithmic techniques. The agnostic learning 
techniques we use are not new but we give attribute-efficien t versions of those techniques using an attribute- 
efficient agnostic learning of parities from (Feldmani l2007h . 



For the £2 a gnostic learning algorithrn we n eed a known observation (e.g. (IGopalan et al.L l2008h ') that 
the algorithm of iKushilevitz and Mansoun (1 19931) can be used to obtain agnostic learning relative to £2 -norm 
of all functions with spectral £1 norm of L in time poly(n, L, 1/e) (we include a proof in App.lAl). We also 
observe that in order to learn agnostically decision trees of depth d it is sufficient to restrict the attention to 
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significant Fourier coefficients of degree at most d. We can exploit this observation to improve t he number 
of va lue queries used for learning by using the attribute-efficient agnostic parity learning from (JFeldman, 



2007h in place of the KM algorithm. Specifically, we first prove the following attribute-efficient version of 



agnostic learning of functions with low spectral ^i-norm (the proof appears in App. EJ. 

Theorem 5.3. For L > 0, we define Cf as {p{x) \ \\p\\i < L and degree{p) < d}. There exists an 
algorithm A that given e > and access to value queries for any real-valued f : {0, 1}" — )• [—1, 1], with 
probability at least 2/3, outputs a function h, suchthat \\f — h\\2 < A + e, where A = minpgCilll/ — Pib}- 
Further, A runs in time poly(n, L, 1/e) and uses poly{d, log(n), L, 1/e) value queries. 

Together with Cor. 14.2 I this implies Theorem 11.61 

Gopalan et al.l (2008) give the li version of agnostic learning for functions of low spectral £i norm. To- 



gether with Cor. l4.2l this implies an ii agnostic learning algorithm for submodular f unctions usin g poly(n, 2^/ 



time and queries. There is no known attribute-efficient version of the algorithm of iGopalan et a l. (2008) and 
their analysis is relatively involved. Instead we use our approximate representation by decision trees to 
invoke a substantia l ly sirnpler algor i thm fo r agnostic learning of decision trees based on agnostic boosting 



(Kalai and Kanade 



li ly sirnpler aigontnm to r agnostic learning oi aecision trees Dasea on agnostic boosting 
,'2009, 'peldman, I2OIO). In this algorithm it is easy to use attribute-efficient agnostic 

learning of parities (Feldman, 2007) (restated in Th. lA. It to reduce the query complexity of the algorithm. 

Formally we give the following attribute-efficient algorithm for learning [0, 1] -valued decision trees. 

Theorem 5.4. Let DTrg i](r) denote the class of all [0, l]-valued decision trees ofrank-r on {0, 1}". There 
exists an algorithm A that given e > and access to value queries of any f : {0, 1}" — )• {0, 1}, with 
probability at least 2/3, outputs a function h : {0, 1}" — t- [0, 1], such that \\f — h\\i < A + e, where 
A = niiiiggD'j' ,(r){||/ — 9\\i}- Further, A runs in time poly(n, 2^, 1/e) and uses poly(logn, 2*", 1/e) 
value queries. 

Combining Theorems 15.41 and 1 1 . 1 1 gives the following agnostic learning algorithm for submodular func- 
tions (the proof is in App. lAl ). 

Theorem 5.5. Let Cs denote the class of all submodular functions from {0, 1}" to [0, 1]. There exists an 
algorithm A that given e > and access to value queries of any real-valued f, with probability at least 2/3, 
outputs a function h, such that \\f — h\\i < A + e, where A = minggc^{||/ — g(||i}. Further, A runs in time 
poly(n, 2^' "^ ) and using poly(log n, 2^' "^ ) value queries. 

6 Lower Bounds 

6.1 Computational Lower Bounds for Agnostic Learning of Submodular Functions 

In this section we show that the existence of an algorithm for agnostically learning even monotone and 
symmetric^ submodular functions (i.e. concave functions of X]^*) to an accuracy of any e > in time 
n°(i/"^ ) would yield a faster algorithm for learning sparse parities with noise (SLPN from now) which is 
a well known and notoriously hard problem in computational learning theory. 

We begin by stating the problems of Learning Parities with Noise (LPN) and its variant, learning sparse 
parities with noise (SLPN). We say that random examples of a function / have noise of rate r/ if the label of 
a random example equals f{x) with probability 1 — r/ and —f{x) with probability rj. 



' In this context, we call a function / : {0, 1}" — >■ R symmetric if f{x) depends only on ^ a;^. Tliis is different from the notion 
of a symmetric set function, which usually means the condition f{S) = f{S). 
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Problem 6.1 (Learning Parities with Noise). For t] G (0, 1/2), the problem of learning parities with noise 
ij is the problem of finding (with probability at least 2/3j the set S C [n], given access to random examples 
with noise of rate rj of parity function xs- For k < n the learning of k-sparse parities with noise ij is the 
same problem with an additional condition that \S\ < k. 



Tlie best known algoritlim for the LPN problem with constant noise rate is by iBlum et alj (l2003h and 
runs in time 2'^("/'°g" '). The fastest known algorithm for learning /c-sparse parities with noise ry is a recent 
bre akthrough result oflValiantI (120 12b whic h runs in time 0(n'^'^^poly( ^_^2r) ))- 



Kalai et alj ( 120081 ) and iFeldmanI ( 120121 ) prove hardness of agnostic learning of majorities and conjunc- 



tions, respectively, based on correlation of concepts in these classes with parities. In both works it is implicit 
that if for every set S CI [n] , a concept class C contains a function fs that has significant correlation with 
Xs (or fs{S)) then learning of parities with noise can be reduced to agnostic learning of C. We now present 
this reduction in a general form. 

Lemma 6.2. Let C be a class of functions mapping {0, 1}" into [—1, 1]. Suppose, there exist 7 > and 
k £ N such that for every S C [n], IS*! < k, there exists a function, fs G C, such that \fs{S)\ > 7. If there 
exists an algorithm A that learns the class C agnostically to accuracy e in time T{n, -) then, there exists an 



algorithm A' that learns k-sparse parities with noise i] < 1/2 in time poly(n 



' (l-2»?)7- 



+ 2T n 



' (l-2»y)7' 



Proof Let xs be the target parity with |5| < A;. We run algorithm A' with e = (1 — 27^)7/2 on the noisy 
examples and let h be the hypothesis it outputs. We also run algorithm A' with e = (1 — 2vi)^/2 on the 
negated noisy examples and let h! be the hypothesis it outputs. 

Now let fs £ Che the function such that |/5(5')| > 7. Assume without loss of generality that fs{S) > 7 
(otherwise we will use the same argument on the negation of fs). Let A/"^ denote the distribution over noisy 
examples. 

For any function / : {0, 1}" -^ [-1, 1], 



^(x,y)^jv4\f{x)-y\] 



(1 - r^)E,^u[\f{x) - Xs{x)\] + r]E,^u[\f{x) + Xs{x)\] 

(1 - r])B,^u[xsix)ixsix) - fix))] + V^x^u[xsix)ixsix) + /(x)) 

l + (l-2r?)/(5). (5) 



This implies that 

^ix,y)r.M4\fs{x) - y|] = 1 + (1 - 27])TsiS) >! + (!- 277)7. 
By the agnostic property of A with e = (1 — 2rj)^/2, the returned hypothesis h must satisfy 
E(x,y)^M^[Hx) - y|] > 1 + (1 - 2r?)7 - (1 - 2r?)7/2 > 1 + (1 - 2r?)7/2. 
By equation ([5]) this implies that h( S) > 7/2. 



We can now use the algorithm of iGoldreich and LevinI (Il989b (or a similar one) algorithm to find all sets 
with a Fourier coefficient of at least 7/4 (with accuracy of 7/8). This can be done in time polynomial in 
n and I/7 and will give a set of coefficients of size at most 0(1/7^) which contains S. By testing each 
coefficient in this set on 0((1 — 2r])^^ log (I/7)) random examples and choosing the one with the best 
agreement we find S. D 

We will now show that there exist monotone symmetric submodular functions that have high correlation 
with the parity functions (the proof is in Appendix O. 
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Lemma 6.3 (Correlation of Monotone Submodular Functions with Parities). Let S C [n] such that \S\ = s 
for some s G [n]. Then, there exists a monotone symmetric submodular function Hs '■ {0, 1}" — ?• [0, 1] such 
that Hs depends only on coordinates in S and \{xSj Hs)\ = r2(,s~^'^). 

Combining this result with Lemma ld!2l we now obtain the following reduction of SLPN to agnostically 
learning monotone submodular functions: 



Theorem 6.4 (Theorem 11.81 restated) . If there exists an algorithm that agnostically learns all monotone 
submodular functions with range [0, 1] to £i error o/e > in time T(n, 1/e) then there exists an algorithm 
that learns (e~^''^)-sparse parities with noise of rate rj < 1/2 in time poly(n, 1 / {e{l — 2r]))) + 2T {n, c/e(l — 
2r]))for some fixed constant a 

Proof. Consider all the monotone submodular functions Rs for every S ^ [n], \S\ < k = e^^/^. Then, 
\{xs,Hs)\ = ri(A;^^'^) = ri(e) by Lemma 1631 Thus, using 7 = ri(e) in Lemma \62\ we obtain the 
claim. n 

6.2 Information-Theoretic Lower Bound for PAC-learning Submodular Functions 

In this section we show that any algorithm that PAC -learns monotone submodular functions to accuracy e 
must use 2^^'^ > examples. The idea is to show that the problem of learning the class all boolean functions 
on k variables to any constant accuracy can be reduced to the problem of learning submodular functions on 
2t = k + [log k] +0(1) variables to accuracy 0(7372 )■ Any algorithm that learns the class of all boolean 
functions on k variables to accuracy 1/4 requires at least $7(2'^) bits of information. In particular at least 
that many random examples or value queries are necessary. 

Before we go on the present the reduction, we need to make a quick note regarding a slight abuse 
of notation: In the lemma below, we will encounter uniform distributions on hypercubes of two different 
dimensions. We will, however, still represent uniform distributions on either of them by U (with the meaning 
clear from the context). 

Lemma 6.5. Let f : {0, l}'^ — ;> {0, 1} be any boolean function. Let t > be such that (^/) > 2'' > {^^Zi) 
(thus 4 • 2 > ( ^ ) > 2 ). There exists a monotone submodular function h : {0, 1}^* — t- [0, 1] such that: 

L h can be computed at any point x G {0, 1}^* in at most a single query to f and in time 0{t). 

2. Let a = „■]{ = 0{1). Given any function g : {0, 1}^* — )■ M that approximates h, that is, 
^x'^u[\h{x) — g{x)\] < a ■ r,fl/2 > there exists a boolean function f : {0, 1} — )• {0, 1} such that 

^x'^u[\f{^) — f{^)\] ^ e '^'^d f can be computed at any point x G {0, 1}'^, with a single query to g 
and in time 0{t). 

Proof. We first give a construction for the function h. It will be convenient first to define another function 
h : {0, 1}^* — ;■ [0, 1] and then modify it to obtain h. Recall that for any x and S C [2t], ws{x) = 
J2ies l(^« + 1)- ^^^ function h would be the same as the function Hs defined in the proof of Lemmalj 



y^\ ^ / n^t]ix)/t W[2t]{x) <t 
^ ' \ 1 wr9^i (x) > t 



W[2t] (x) > t 

We will now define h using h and /. The key idea is that even if we lower the value of h at any x with 
^[2*1(3^) = ^ by ^, the resulting function remains submodular. Thus, we embed the boolean function h by 
modifying the values of h at only the points in the middle layer {w[2t] {x) = t). 
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Let s = (^/). Let A'ht = {x e {0, l}^* | wi2t]{x) = t} and Mk = {y e {0, l}*^} and s > 2^. Let 
/3 : Mfc — >• M2t be an injective map of M^ into M2t such that both /? and /3^^ (whenever it exists) can be 
computed in time 0{t) at any given point. Such a map exists, as can be seen by imposing lexicographic 
ordering on M2t and M^ and defining I3{x) for x G M2t to be the element in Mk with the same position in 
the ordering as that of x. For each a; G {0, 1}^*, let /i be defined by: 



h(x) 



Hx) W[2t]ix)i^t 

(1 — 5j) W[2t]{x) = t, (3^^{x) exists and f{l3~^{x)) = 

1 w\^2t]{x) = t, (3^^{x) exists and f{(3^^{x)) = 1 

1 otherwise 



Notice that given any x G {0, 1}^* the value of h{x) can be computed by a single query to /. Further, 
observe that h is monotone and h is obtained by modifying h only on points in M2t and by at most ^, which 
ensures that for any x < y such that W!2t] (x) < W!2t] (y)> h{x) < h{y). Moreover, M2t forms an antichain in 
the partial order on {0,1}" and thus no two points in M2t are comparable. This proves that h is monotone. 
Suppose, now that g : {0, 1}^* — > M is such that 'Eix^u\\h{x) — g{x)W < a ■ ^^■ 

Define gi, : {0, 1}^* -^ {0, 1} so that 

Vx G {0, 1}2*, g,{x) = sign ig{x) - (1 - (l/4t))) . 

Finally, let / : {0, 1}'' -^ {0, 1} be such that for every x G {0, 1}'' f{x) = gb{(3{x)). 
Now, B,^u[\f{x) - f{x)\] = 2PT^^u[f{x) + /(x)]. For any x G {0, l}^ 

/(x)//(x)4^|<7(/3(x))-M/3(x))|>l 

Using that Prj,~w[/3~^(y) exists ] = -y=, we have: 

E,^w[|5(y) - h{y)\\ > i Pi;\/3-\y) exists and fiP~\y) / firHv)] 

Using Ej;^w[|5r(y) - h{y)\] < a ■ ^7^^, we have: E^^u[\f{x) - f{x)\] < e. 

Finally, we show that h is submodular for any boolean function /. It will be convenient to switch 
notation and look at input x as the indicator function of the set Sx = {xi \ Xi = I}. We will verify that for 
each S C [n] and i,j ^ S, 

h{Su{i})-hiS)>h{Su{i,j})-h{Su{j}). (6) 

Notice that h is submodular, and h = h on every x such that W[2t] {x) 7^ t. Thus, we only need to check 
Equation (O for S, i,j such that l^l G {t — 2, t — 1, t}. We analyze these 3 cases separately: 

1. |S| = t - 1 : Notice that h{S) = h{S) = 1 - (1/t) and h{S U {i,j}) = h{S U {i,j}) = 1. Also 
observe that for any /, h{SU{i}) a.ndh{SLl{j}) are at least (l-^j). Thus, h{SLl{i}) + h{SU{j}) > 
2-^ = h{S) + h{SU{i,j}). 

2. |S| = t - 2 : In this case, h{S) = (1 - (2/t)) and h{S U {i}) = h{S U {j}) = (1 - (1/t)). In this 
case, the maximum value for any /, of h{S U {i, j}) = 1. Thus, 

h{S) + h{S U {i,j}) < 2 - (2/t) = h{S U {i}) + /i(S U {j}). 
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3. 151=* :Here, h{SU{i}) = h{SLl{j}) = h{SU{i,j}) = 1. The maximum value of /i(5) for any 
/ is 1. Thus, 

h{S) + h{S U {i,j}) <2 = h{SU {i}) + h{S U {j}). 

This completes the proof that h is submodular. D 

We now have the following lower bound on the running time of any learning algorithm (even with value 
queries) that learns monotone submodular functions. 



Theorem 6.6 (Theorem 1 1.7 1 restated). Any algorithm that PAC learns all monotone submodular functions 
with range [0, 1] to £i error ofe>0 requires 2^^'^ ' value queries to f. 



Proof. We borrow notation from the statement of Lemma 16.51 here. Given an algorithm that PAC learns 
monotone submodular functions on 2t variables, we describe how one can obtain a learning algorithm for 
all boolean function on k variables with accuracy 1/4. Given an access to a boolean function / : {0, l}'^' — )• 
{0, 1}, we can translate it into an access to a submodular function h on 2t variables with an overhead of 
at most 0{t) = 0{k) time using Lemma [63] Using the PAC learning algorithm, we can obtain a function 
g : {0, 1}^* — )• M that approximates h within an error of at most a ■ ^jjj^ and Lemma [631 shows how 
to obtain / from g with an overhead of at most 0{t) = 0{k) time such that / approximates / within 
i. Choose k = [e^^/'^] and t as described in the statement of Lemma [631 Now, using any algorithm 
that learns monotone submodular functions to an accuracy of e > we obtain an algorithm that learns all 
boolean functions oak = [e"^/^] variables to accuracy 1/4. D 
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A Attribute-efficient Agnostic Learning 

In this section we give attribute-efficient versions of two agnostic learning algorithms: (1) the ^2 -error 
agnostic learning of functions with low spectral ^i-norm and (2) £1 -error agnostic learning of (real- valued) 
decision trees. The algorithms are obtained using a simple com bination of existing techniques with attribute- 



efficient weak agnostic parity learning from (Feldmanl l2007h . For the first algorithm we are not aware of 



published details of the analysis even without the attribute-efficiency. 

We first state the attribute-efficient weak agnostic parity learning from (IFeldmanl. 120071) . 



Theorem A.l. There exists an algorithm WP, that given an integer d, 9 > and 5 G (0, 1], access to value 
queries of any f : {0, 1}" — )• [—1, 1] such that |/(5)| > for some S, \S\ < d, with probability at least 
1 — (5, returns S', such that \f{S')\ > 6/2 and \S'\ < d. WP(d, 6, 6) runs in O {nd^O"'^ log (1/5)) time and 
asks O (d^ log n • 6~'^ log (1/(5)) value queries. 

UsingWPwecanfindasetcSofsubsetsof [n]suchthat(l)ifSG5then|/(5)| > 6^/2 and jS"] < d;(2)if 
1/(5)1 > ennd\S\ < dthenS' G S. The first property, implies that \S\ < 4/6^. With probability 1-5, Scan 
be found in time polynomial in 1/0^ and the running time of WP((i, 9, 46/6'^). With probability at least 1 — 6, 
each coefficient in S can be estimated to within 6/4 using a random sample of s ize 0(log (1/(5)/^^). This 



gives the following low-degree version of the Kushilevitz-Mansour algorithm (IKushilevitz and Mansoui , 



19931). 



Theorem A.2. There exists an algorithm AEFT, that given an integer d, 6 > and 5 G (0, 1], access to 
value queries of any f : {0, 1}" — )• [—1, 1], with probability at least 1 — 6, returns a function h represented 
by the set of its non-zero Fourier coefficients such that 

1. degree(/i) < d; 

2. for all S C [n] such that \f{S)\ > 6 and \S\ < d, h{S) / 0; 

3. for all S C [n], if\f{S)\ < 6/2 then h{S) = 0; 

4. ifh{S) / then \f{S) - h{S)\ < 6/4. 

AEFT((i, 6, 6) runs in O {nd^6^'^ log (1/(5)) time and asks O {d^ log^ n ■ 6^"^ log {1/6)) value queries. 

We now show that for 6 = e^/(2L), AEFT agnostically learns the class 

Cf = {p{x) I \\p\\i < L and degree(p) < d} . 

Lemma A.3. For L > 0, e G (0, 1) and integer d, let f : {0, 1}" — t- [—1, 1] and h :— t- M be functions such 
thatfor6 = e^/{2L), 

1. degree(/i) < d; 
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2. for all S C [n] such that \f{S)\ >eand\S\< d, h{S) ^ 0; 

3. for all S C [n], if\fiS)\ < 6/2 then h{S) = 0; 

4. ifh{S) / then \f{S) - h{S)\ < 6 /A. 
Then for any geClWf- hh < \\f - gh + e. 
Proof. We show that for every 5 C [n], 

if{S) - h{S)f < ifiS) - g{S)f + 26 . \g{S)\ = {f{S) - g{S))^ + '—M^. (7) 



L 



First note that this would immediately imply that 



\\f-h\\l=Y. (hs) - hiS)f < Y: ihs) - g{S)? + '-^ = \\f-gg + '-f^ 

SC[n] SC[n] 

<\\f-g\\l + e'<i\\f-g\\2 + ef. 

To prove equation ^ we consider two cases. If h{S) = 0, then either IS*! > d or |/(S')| < 6. In the former 
case g{S) = and therefore equation © holds. In the latter case: 

ifiS) - Hs)r = {fiS)f < if{S) - giS)f + 2\f{S)\ • \giS)\ < {f{S) - g{S)f + 26 • \g{S)\ . 

In the second case (when h{S) / 0), we get that \f{S)\ > 6/2 and \f{S) - h{S)\ < 6/4. Therefore, 
either \g{S)\ < \f{S)\/2 and then {f{S)-g{S))^ > {f{S)f/A > ^2/16 or \g{S)\ > |/(5)|/2 > 0/4 and 
then 26 ■ \g{S)\ > 6^/2. In both cases, 



ifiS) - h{S)f <-< ifiS) - g{S)f + 26 . \g{S)\ . 



D 



Theorem 15. 3l is a direct corollary of Theorem IA.2 l and Lemma \A3\ 

The proof of Theorem 15.5 I relies on agnostic learning of decision trees. We first give an attribute-efficient 
algorithm for this problem. 

Theorem A.4. Let DT(r) denote the class of all Boolean decision trees of rank-r on {0, 1}". There exists 
an algorithm A that given e > and access to value queries of any f : {0, 1}" — t- {0, 1}, with probability 
at least 2/3, outputs a function h : {0,1}" — ?■ {0,1}, such that PtkIJ ^ h] < A + e, where A = 
ioamg^-Q'Y(^r){^^u[f / g]}- Further, A runs in time poly(n, 2*", 1/e) and uses poly(logn, 2'", 1/e) value 
queries. 

Proof. We first use Theorem l4.1l to reduce the problem of agnostic learning of decision trees of rank at most 
r t o the problem of agno stic learning of decision trees of depth |(r + log (2/e)) with error parameter e/2. 
In (lFeldmanl.l201Cl) and dKalai and Kanadd . 12009 ) it is shown that a distribution-specific agnostic boosting 



algorithm reduces the problem of agnostic learning decision trees of size s with error e' = e/2 to that of 
weak agnostic learning of decision trees invoked 0{s^ /e'"^) times. It was also shown in those works that ag- 
nostic learning of parities with error of e' /{2s) gives the necessary weak agnostic learning of decision trees. 
Further, as can be easily seen from the proof, for decision trees of depth < d it is sufficient to agnostically 
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learn parities of degree < d. In our case the size of the decision tree is < 2'^ = (2^+^/e)^/^. We can use 
WP algorithm with error parameter e' /{2s) > e^/^/2'2"+^ and degree d, to obtain weak agnostic learning 
of decision trees in time poly(n, 2^' , 1/e) and using poly(logn, 2^, 1/e) value queries. This implies that 
agnostic learning of decision trees can be achieved in time poly(n, 2^, 1/e) and using poly (log n, 2"^ , 1/e) 
value queries. D 

From here we can easily obtain an algorithm for agnostic learning of ra nk-r decision trees with real- 



valued constants from [0, 1]. We obtain it by using a simple argument (see (IFeldman and Kotharil 120131) 
for a simple proof) that reduces learning of a real-valued function g to learning of boolean functions of the 
form ge{x) = "^(a;) > 0" (note that every g : {0,1}" — )■ [0,1], is e-close (in ii distance) to g'{x) = 
J2ie 1 1/e I 9i<:i^))- ^^ "^o^ observe that if g can be represented as a decision tree of rank r, then for every 9, 
go can be represented as a decision tree of rank r. Therefore this reduction implies that agnostic learning of 
Boolean rank-r decision trees gives agnostic learning of [0, 1] -valued rank-r decision trees. The reduction 
runs the Boolean version 2/e times with accuracy e/2 and yields the proof of Theorem 

B Learning Pseudo-Boolean Submodular Functions 



In a recent work, iRaskhodnikova and YaroslavtsevI (|2013h consider learning and testing of submodular func 



tions taking values in the range {0,1, ... ,k}. The error of a hypothesis in their framework is the probability 
that the hypothesis disagrees with the unknown function (hence it is referred to as pseudo-Boolean). For 
this restriction they give a poly(n) • k'-^^''^°^^''^>-time PAC learning algorithm using value queries. 

As they observed, error e in their model can also be obtained by learning the function scaled to the range 
{0, 1/k, . . . , 1} with £i error of e/k (since for two functions with that range E[|/ — h\] < e/k implies that 
Pr[/ y^ h] < e). Therefore our structural results can also be interpreted in their framework directly. We now 
show that even stronger results are implied by our technique. 

The first observation is that a ^-^-Tg-Lipschitz function with the range {0, 1/k, . . . , 1} is a constant. 
Therefore Theorem 13. 1 1 implies an exact representation of submodular functions with range {0, 1, ... , k} 
by decision trees of rank < [2k + 2/3J = 2k with constants from {0, 1/k, . . . , 1} in the leafs. We 
note that this representation is incompa rable to 2A;-DNF representation which is the basis of results in 
(IRaskhodnikova and YaroslavtsevI . |20 13b . 



We can also directly combine Theorems 13.11 and |4. 1 1 to obtain the following analogue of Corollary 11.31 

Theorem B.l. Let f : {0, 1}" — )• {0, 1, . . . ,k} be a submodular function and e > 0. There exists a 
{0, 1, . . . , k} -valued decision tree T of depth d = 5(fc + log (1/e)) such that Y'ii([r 7^ /] < e. In particular, 
T depends on at most 2^^ /e* variables and \\f\\i < 2k ■ '^^ /e'. 



These results improve on the spectral norm bound of kOi^^°sk/t) fjomj Raskhodnikova and Yaroslavtsev 



20131) . In a follow-up (independent of this paper) work Blais et al.l (120131) also obtained an approximation 



of discrete submodular functions by juntas. They prove that every submodular function / of range of size k 
is e-close to a function of (A;log(A;/e))'^('^^ variables and give an algorithm for testing submodularity using 
{k log(l/e))'^'''^'* value queries. Note that our bound has abetter dependence on k but worse on e (the bounds 
have the same order when e = fc"'^). 

As in the general case, these structural results can be used to obtain learning algorithms in this setting. 
It is natural to require that learning algorithms in this setting output a {0, 1, ... , fc}-valued hypothesis. We 
observe that the algorithm in Theorem 15 .4 l ean be easily modified to return a {0, 1/k, . . . , l}-valued function 
when it is applied for learning {0, 1/k, . . . , l}-valued functions. This is true since the proof of Theorem 
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(see Section |A]discretizes the target function and reduces the problem to learning of Boolean functions. 
{0, 1/k, . . . , l}-valued functions are already discretized. With this exact discretization the output of the 
agnostic algorithm is a sum of k Boolean hypotheses, and in particular is a {0, l/Zc, . . . , l}-valued function. 
This immediately leads to the following algorithm for agnostic learning of {0, 1, ... , fc}-valued submodular 
functions. 

Theorem B.2. Let Cg denote the class of all submodular functions from {0, 1}" to {0, 1, . . . , k}. There 
exists an algorithm A that given e > and access to value queries of any f : {0,1}" — )■ {0,1,... , A;}, with 
probability at least 2/3, outputs a function h with the range in {0, 1, . . . , k}, such that E^[|/ — /i|] < A + e, 
where A = vii\\ig^Qk{^u[\f — g\W- Further, A runs in time poly(n, 2'^, 1/e) and uses poly(logn, 2*^, 1/e) 
value queries. 

This improves on poly(n) • fc O(fei°gfe/g)-time and queries algorithm wi th the same guarantees which is 
implied by the spectral bounds in ((Raskhodnikova and Yaroslavtsevl . l2013h . We remark that the guarantee 



of this algorithm implies PAC learning with disagreement error (since for integer valued hypotheses li- 
error upper-bounds the disagreement error). At the same time the guarantee is not agnostic in terms of the 
disagreement erroio (but only for ^i -error). 

The structural results also imply that when adapted to this setting our PAC learning algorithm in Theorem 
ll.5l leads to the following PAC learning algorithm in this setting. 

Theorem B.3. There exists an algorithm A that given e > and access to random uniform examples of any 
f G Cg, with probability at least 2/3, outputs a function h, such that P^cylf y^ h] < e. Further, A runs in 
time 0(n2) • 20('='+i°g'(V^)) and uses 20(fc'+ios'(i/^)) log n examples. 

For learning from rand om examples alone, previous structural results imply only substantially weaker 
bounds: (poly(n^, 1/e) in (jRaskhodnikova and Yaroslavtsevl . l2013|) ). 



Finally, we show that the combination of approximation by a junta and exact representation by a decision 
tree lead to a proper PAC learning algorithm for pseudo-Boolean submodular functions in time poly(n) • 
20(k +fciog(i/e)) usjjjg value queries. Note that, for the general submodular functions our results imply only 
a doubly-exponential time algorithm (with singly exponential number of random examples). 

Theorem B.4. Let Cg denote the class of all submodular functions from {0, 1}" to {0, 1, . . . , k}. There ex- 
ists an algorithm A that given e > and access to value queries of any f € C^, with probability at least 2/3, 
outputs a submodular function h, such that Pr[/ y^ h] < e. Further, A runs in time poly(n, 2^ +fclogi/e^ 
and uses poly(log n, 2 +«=l°gi/'s) value queries. 

Proof Outline: In the first step we identify a small set of variables J such that there exists a function that 
depends only on variables indexed by J and is e/3 close to /. This can be achieved (with probability at 
least 2/3) by using the algorithm in Lemma 11.41 (with bounds adapted to this setting) to obtain a set of 
size poly(2^/e). Now let Uj represent a uniform distribution over {0, 1}"^ and Uj represent the uniform 
distribution over J = [n\\ J. Let g be the function that depends only on variables in J and is e/3 close to 
/. Then, 



Pr[/(x) / g{x)] = B^^uj 

La 



Pr [/(y,2)/<?(y,0)] 



<e/3. 



By Markov's inequality, this means that with probability at least 1/2 over the choice of z from {0, 1}"', 
^^y^uAfiy^z) / 5(y,0)] < 2e/3 and hence Pry^2^j,^^t/j[/(y,^) / f{y,w)\ < e. In other words, a 



^In dRaskhodnikova and Yaroslavtsevl 1201 3h it was mistakenly claimed that the application of the algorithm of iGopalan et al.l 
( 120081) gives agnostic guarantee for the disagreement error. 
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random restriction of variables outside of J gives, with probability at least 1/2, a function that is e-close 
to /. As before we observe that a restriction of a submodular function is a submodular function itself. We 
therefore can choose z randomly and then run the decision tree representation construction algorithm on 
f{y,z) as a function of y described in the proof of Theorem 13. II It is easy to see that the running time of 
the algorithm is e ssentially determined by the size of the tree. A tree of rank 2k over \J\ variables has size 
of at most |JP (JEhrenfeucht and Haussletl Il989|) . Therefore with probability at least 2/3 • 1/2 = 1/3, 



in time poly(n, 2*^ +fciogi/e^ ^^^ using poly(log?i, 2^ +kiogi/e-^ value queries we will obtain a submodular 
function which is e-close to /. As usual the probability of success can be easily boosted to 2/3 by repeating 
the algorithm 3 times and testing the hypothesis. D 



C Proof of Lemma 6.3 



Since the functions we are dealing with are going to be symmetric, we make the convenient definition of 
weight of any x G {0, 1}". For any x E {0, 1}", the weight of x over a subset S C [n] of coordinates is 
defined as ws{x) = Yli^s ^«- 

Our correlation bounds for monotone symmetric submodular functions will depend on the following 
well-known observation which we state without proof. 

Fact C.l (Symmetric Submodular Functions from Concave Profiles). Let p : {0, 1, . . . , n} :— )• [0, 1] be any 
function such that, 

VO < i < n - 2, p(i + 1) - p{i) > p{i + 2) - p{i + 1). 

Let fp : {0, 1}" — )• [0, 1] be a symmetric function such that fp{x) = p(wr„i(x)). Then f is submodular. 

Remark C.2. Observe that for any submodular function f : {0,1}'^ — )• [0,1], the correlation with the 
parity xs depends only on the profile of f, pj : {0, 1, . . . , n} — )■ [0, 1], 

^*' x:ws{x)=i 

That is, if f : {0, 1}"^ — )■ [0, 1] is defined by f{x) = pf{ws{x)) for every x G {0, 1}", then {f,Xs) = 
{f,Xs)- Thus for finding submodular functions with large correlation with a given parity, it is enough to 
focus on symmetric submodular functions. 

We will need the following well-known formula for the partial sum of binomial coefficients in our 
correlation bounds. 

Fact C.3 (Alternating Binomial Partial Sum). For every n,r,k £ N, 

^n — 1 



?}-<) 



-1) . 
, , , s r 



Proof of Lemma WSl Notice that the parity on any subset 5 C [n\ of variables at any input x G {0, 1}" is 
computed by xs{x) = (— l)"'s{a;) -^q y^jjj j^q^^ define a symmetric submodular function Rs : {0, 1}"^ — ;■ 
[0, 1] and then modify it to construct a monotone symmetric submodular function Hs : {0, 1}"^ — )• [0, 1] 
that has the required correlation with the associated parity xs- It is easy to verify that the natural extension 
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of Rs and Hs to {0, l}"(from {0, 1}'^), that just ignores all the coordinates outside S, is submodular and 
thus it is enough to construct functions on {0, l}*^. 

The definition of Rs will vary based on the cardinality of S. If S is such that s = 2k for some A; G N, 
let Rs for each S C [n] be defined as follows: 



^'^"^ " 1 1 - ^^^^^, wsix) > k 



On the other hand, if S is such that s = 2A; — 1 for some k £ N, define: 






Notice that with this definition, Rs : {0, 1}" — )• [0, 1] and has its maximum value exactly equal to 1. 
Further, since Rs can be seen to be defined by a concave profile. Fact IC . 1 1 guarantees that Rs is submodular. 
We will now compute the correlation of xs with Rs- We will first deal with the case when IS*! is even. 

Let s = 2k for some A: G N. 



{Rs,Xs) = ^ Y^ Rsix)xs{x) 






22fc A^ \ i J k ^-^ \ i / k 

Substituting j = 2k — i 

^i-E(?)(-)1.E(T)(-)1 

i=o ^ ^ j=o ^ -^ ^ 

-(^■i-i:(i-;)<-)-)-<-)^-^e:) 

Using the partial sum formula from Fact IC. 3 1 gives: 

. xfr 2 1 /2k -1 
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Now suppose s = 2A; — 1 for some /c G N. 



22k- 

xe{o,i}2*-i 



1 


fc-i 

j=0 




J 


(' 


22fc-l 


I 


Substituting j = 


= 2A;- 


-1 


— i 




fc 


/n7 \ 







^'" ^ '2k -l\, ^^,,_ i-k + l. 



ir^ + E^r )<-')■<' 



fc-l 



k-E(T)(-)i-Err)<-)'H 



22k 

3=0 

Again, using the partial sum formula from Fact |C. 3 1 gives: 

— 3 —3 

In either case, we now obtain that \{Rs, Xs)\ = ^ik~^) = r2(s~2"). 

For the remaining part of the proof, we need to define the function Hs- We obtain Hg by a natural 
"monotonization" of Rs- Thus, if s = 2k, let Hs be defined as: 

^^ ^ 1 1 u;5(x)>A; 
On the other hand, if S is such that s = 2A; — 1 for some /c G N, define: 

I 1 ws^xj > k 

Notice again that Hs : {0, 1}'^ — > [0, 1] and Hs is submodular by Fact |C. II To obtain a lower bound on 
Kxs, ^5)1, ^5 can be seen as the average of a monotone linear function and Rs, that is, if s = 2k, Vx, 
^5(3;) = ^{Rs{x) + ^^) and if s = 2A; - 1, Vx, Hs{x) = ^{Rs{x) + ^f^). It is now easy to obtain 
a lower bound on the correlation of xs with Hs- 
For s = 2k, 

{xs,Hs) = -{xs,Rs) + ^{xs,-r)- 

2 2 k 

For s = 2k-l, 

1 1 ws 

{xs,Hs) = -{xs,Rs) + 7;{xs, ' 



s-l\ 



Finally, observe that for any s = \S\, {xs, ws{x)) = J2t=o (D ("l)' " ^ = « EI=o (^1) ("1)' " ^ = 0- This 
immediately yields the required correlation. 

D 
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